# A tibble: 6 × 4
lprice points country variety
<dbl> <dbl> <fct> <fct>
1 2.71 87 Other Other
2 2.64 87 US Other
3 2.56 87 US Other
4 4.17 87 US Pinot Noir
5 2.71 87 Spain Other
6 2.77 87 Italy Other
Explanataion:
We create a new column lprice which is the logarithm of the price column.
We lump the country column into the top 4 most common countries and group the rest into “Other”.
We lump the variety column into the top 4 most common varieties and group the rest into “Other”.
We select only the lprice, points, country, and variety columns.
We remove any rows that contain missing values.
Finally, we display the first few rows of the resulting wino dataframe using the head function.
Caret
We now use a train/test split to evaluate the features.
Use the Caret library to partition the wino dataframe into an 80/20 split.
Run a linear regression with bootstrap resampling.
Report RMSE on the test partition of the data.
set.seed(123)trainIndex <-createDataPartition(wino$lprice, p =0.8, list =FALSE)wino_train <- wino[trainIndex, ]wino_test <- wino[-trainIndex, ]train_control <-trainControl(method ="boot", number =100)model <-train(lprice ~ ., data = wino_train, method ="lm", trControl = train_control)predictions <-predict(model, wino_test)rmse <-sqrt(mean((wino_test$lprice - predictions)^2))rmse
[1] 0.4902949
Explanation
We set a seed for reproducibility.
We create a training index that partitions the wino dataframe into an 80/20 split.
We create training and testing datasets using the partition index.
We define the training control using bootstrap resampling with 100 iterations.
We train a linear regression model using the training data and the defined training control.
We make predictions on the test data using the trained model.
We calculate the Root Mean Squared Error (RMSE) to evaluate the model’s performance on the test data.
Variable selection
We now graph the importance of your 10 features.
plot(varImp(model, scale =FALSE))
Explanation
We use the varImp function from the caret package to calculate the importance of each feature in the model.
We plot the variable importance using the plot function, which helps us visualize the significance of each feature in predicting the target variable lprice.